This summary was written by Wendy Warr based on Jean-Claude Bradley's talk at the Sping 2008 ACS meeting in New Orleans.
See more of Wendy Warr's reports at http://www.warr.com

Cheminformatics in Open Notebook Science
Jean-Claude Bradley, Department of Chemistry, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104, bradlejc@drexel.edu

Recently there has been a movement towards making the scientific process more open. Bradley described a spectrum of research tools from closed to open, from the traditional, unpublished laboratory notebook, to the traditional journal article (which still does not include a lot of the elements such as the failed experiments), to an open access journal article, to a fully transparent open notebook source. The spectrum in teaching runs from a traditional paper text book and face-to-face lectures, to public lecture notes, to public assigned problems, to archived public lectures and free online textbooks

The UsefulChem project in chemistry ([[|http://usefulchem.wikispaces.com]]) is an open source science project, using Web 2.0 tools, led by the Bradley Laboratory at Drexel University. Since all laboratory experimental results are made public, the work is also described as Open Notebook Science. Bradley reported on a project designed publicly to report ongoing research within a research group working on the development of anti-malarial and anti-tumour agents. The project makes use of free hosted tools as much as possible so that the infrastructure can be easily replicated by other research groups. InChIs, InChIKeys and compound names are used as tags as on blog and wiki pages to facilitate indexing on common search engines. The handling of large libraries and interfacing with online databases is generally accomplished with SMILES lists. Substructure searching and annotation are handled by ChemSpider. JSpecView is used to manipulate JCAMP-DX spectra over a browser interface.

One of the greatest benefits to Bradley of doing Open Notebook Science has been to find some excellent collaborators: Rajarshi Guha of Indiana University has been doing docking for Bradley’s synthetic chemistry team, others workers have tested compounds. For example, Phil Rosenthal of the University of California, San Francisco, has been testing compounds for anti-malarial activity. Bradley’s current talk was given in the context of anti-malarial agents.

On the blog, milestones or larger problems are typically reported. There are not any hard core experiments in the blog because that would be very monotonous. Blog entries cover topics that are more interesting to a broader audience. Bradley showed a screen shot of work targeting the enzyme falcipain-2 (Phil Rosenthal is testing compounds in this field) and he clicked on a link to EXP150, going to the wiki (http://usefulchem.wikispaces.com/Exp150) that is the laboratory notebook of how those compounds are actually made. He pointed out a compound from an Ugi reaction and a picture of its crystal.

Someone might have come upon this long page by doing a Google search. The page has a complete explanation of the experiment, including the molecules. Bradley clicked on one molecule and was taken to the ChemSpider entry for that compound. (See the earlier talk by Tony Williams.) There are also links to the experimental plan and to the docking procedure. Bradley pointed out Guha’s docking results. By clicking on the results links, Bradley ended up with a Google Doc that had a list of SMILES in order of docking score versus falcipain-2. Here were the top ten compounds that someone should perhaps consider synthesising.

In the procedure section the information is written in such a way that it could be quickly copied and pasted when the work is submitted for publication in a traditional article. On part of the same page there is a results section and here there is a link to an NMR spectrum in JCAMP format, Robert Lancashire’s free JSpecView program is used to handle any spectra in JCAMP format in a browser. The spectrum has raw data behind it; peaks can be expanded, and coupling constants calculated. Compare this with the sort of less useful spectrum stored as a PDF in a supplementary information section. The JCAMP format has been put to other uses. For example one of Bradley’s students has written Excel VBA to monitor reactions and automate kinetics analysis. The start time of the experiment is input and the program calculates the different concentrations over time.

Bradley uses ChemSpider to characterise compounds. The spectra he showed were “approved” ones, i.e., the spectra of the final isolated compounds. In practice, impure or unintended compounds are also made and their spectra must be accounted for, but currently ChemSpider is used to store only the “best” compounds: those that would be submitted to a traditional journal. Bradley is also working with Andy Lang in using Second Life to display NMR data.

All the details of the experiment are on one very long page, but there has to be a log, so that the researchers can actually construct the rest of the experiment based on the log. A proper log is absolutely critical. When Bradley says that his team’s experiments are in real-time, what he means is that the log has to be online by the end of the day. The other sections of the experiment may take weeks to be uploaded. Readers do not have to take Bradley’s word for it that a reaction yield was 59%: they can go back and reinvestigate every single aspect and the arguments that were made.

Retrieving compounds in experiments has been a challenge. Bradley has used a tag section: at the very bottom of each blog or wiki page, there are tags. He chose not to use SMILES as tags because there are multiple SMILES for a given compound. InChIs worked well for small molecules, but for very large molecules, such as Bradley’s Ugi products, the InChIs are not indexed properly by Google. So, recently, Bradley has started to use InChIKeys. He uses ChemSpider to provide the service to generate those InChIKeys based on either the InChI or the SMILES that are submitted to it. Bradley showed how a Google search for a partial InChIKey came up with all of the different experiments where that compound was used. Using Google Custom Search, all of Bradley’s blogs, and wikis and all the pages that his team has generated can be searched by a special Google search that will look only at Bradley’s approved pages. All this is available free of charge for anyone to do.

How are people finding Bradley’s experiments through Google? They could be looking for specific compounds, for example looking for the NMR of TFA. Or they might input a molecular formula. They could be asking “tell me everything that you know about guanidine” and they would certainly find hits in ChemSpider. They might be looking for experimental conditions. Searching for side reactions of amines, say, in the traditional literature is not likely to be useful, yet a typical laboratory notebook is almost all failures. A user searching for kinetics of Boc deprotection in UsefulChem will be able to find lots of kinetics analysis. For teachers the site also offers chemistry videos for free downloading and 3D periodic tables. Other people are looking for bigger pictures, like lysomal targets, cheminformatics, and project proposals. Experiments can also be searched by the traditional table of contents file.

To use all the information in a much more meaningful way, it ought to be possible to compare different experiments. This might be best done with database technology but currently Bradley is using Google Docs because they are simple and there are not a lot of results to track. At some point the information should be imported into a real database.

One of the things that Bradley and co-workers observed in Ugi reactions is that sometimes they get a precipitate and sometimes they do not. A precipitate is desirable because the reaction can then be scaled up without chromatographic complications. So the team has been trying to predict which Ugi product is actually going to precipitate. They put everything into a publicly available table on which people are free to run models. Guha has built models to predict compounds that Bradley has not tried to make yet. Mesa Analytics has also started to predict compounds that should precipitate. It would be good to have ways of comparing models to each other in a very systematic way, keeping the effort fully open, so that it is very easy for people to participate and collaborate with the Bradley laboratory.

Bradley’s students have been keeping logs but there are no rules for those logs, except everything that is done must be recorded. Students are not required to use specific keywords. Therefore it is difficult to convert the logs into a format that a machine could use: you could not readily extract the data and put them into a database. Recently Bradley and co-workers have been rewriting their logs in a format that should be machine-readable.

Bradley showed a workflow where words such as “add” and “vortex” have defined meanings and certain parameters are specified. For example, the molecule is specified using the InChIKey, but common names are also specified, because they are human readable. If somebody wanted to convert the InChIKeys to SMILES, they could do that easily: they would just scrape the information and then convert it.

Results can be examined and compared. Individual results from experiments are stripped out and left to stand on their own. For example, the researcher would mix compounds together and then wait four hours and take a picture. That is not the whole experiment, it is just the first data point. Then a picture will be taken at eight hours, and at 12 hours. If something bad happens at the 12th hour, such as the student dropping the sample, this would be called an aborted experiment, but every individual result is still addressable and can be mined. Somebody who may not be interested in the Ugi reaction, may just be interested in all the reactions where an amine has reacted with an aldehyde, and would be able to find the required information. A user might ask “What happens if I mix A and B together and wait four hours?”.

When Bradley first started on the UsefulChem project, he used the blog actually to record experiments, but it turned out that the blog was really not the best tool for this because if you change an entry, you have no record of it. You cannot tell if it was changed, and you do not know who changed it. With the wiki, you can see all the recent changes. Thus in EXP150, shown earlier, you can see all the different versions and who made the changes. You can compare any two versions and using wiki spaces, the new data show up in green and the data that were deleted show up in red.

The wiki can be used as a means of organising results, and to explain failures: something that typically you do not make public. For example, Bradley and co-workers tried to make DOPAL. They eventually did make it but their initial failures were really interesting. UsefulChem records the story and gives links to the papers Bradley tried to use that had wrong information.

UsefulChem also uses a mailing list which facilitates collaboration between groups. Bradley has also started to tap into Collaborative Drug Discovery (CDD – see the paper by Bunin elsewhere in this report). Drexel University now has users on CDD and has found two compounds weakly active against falcipain-2.

Another advantage of making all of this information public is that Bradley gets contacted by people that you he really did not expect to contact him. Brent Friesen at Dominican University was interested in having his sophomore students do something more interesting than repeating experiments that have been repeated for 20 years and he thought that the Ugi reaction might be an answer. So he contacted Bradley. Friesen has now written a manual for Chem 254 involving the Ugi reaction and his students will be doing new reactions and will be getting those compounds tested against malaria.

Other people are doing Open Notebook Science. Gus Rosania’s group at the University of Michigan is collaborating with Bradley, studying drug transport in the parasite and red blood cells. Cameron Neylon at Southampton has also been doing Open Notebook Science, using a modified blog instead of a wiki.

Bradley closed with some observations on where science is heading. We are getting to the stage where human beings can actually collaborate with machines if the human beings choose to make information available to them. Eventually we will get to the point where machines can actually do real science, formulating hypotheses, testing them, analysing the results, and then planning the next experiment. If this is to happen, we need to have free services; we need to have a possibility that anybody in the world can write a script that will try to process information and produce something useful.